Skip to content

Comments

Extracts UTF-8 sequence validation to BitString class#717

Draft
mward-sudo wants to merge 4 commits intobartblast:devfrom
mward-sudo:02-18-extracts_utf-8_sequence_validation_to_bitstring_class
Draft

Extracts UTF-8 sequence validation to BitString class#717
mward-sudo wants to merge 4 commits intobartblast:devfrom
mward-sudo:02-18-extracts_utf-8_sequence_validation_to_bitstring_class

Conversation

@mward-sudo
Copy link
Contributor

@mward-sudo mward-sudo commented Feb 18, 2026

Closes #713

Dependencies

Please note that this PR includes commits from the PR(s) it is dependent upon. Once the dependent PR(s) are merged to the dev branch, then this PR will be rebased and will then only contain its own commits. This PR will remain in draft until that point.

Summary by CodeRabbit

  • New Features

    • Added UTF-8 decoding and validation helpers to the core bitstring utilities.
  • Refactor

    • Consolidated UTF-8 validation logic across unicode handlers to use shared utilities, reducing duplication.
  • Tests

    • Added comprehensive UTF-8 decoding and validation tests covering multi-byte sequences, invalid bytes, overlong encodings, surrogates, and boundary cases.

@coderabbitai
Copy link

coderabbitai bot commented Feb 18, 2026

Note

Reviews paused

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds four static UTF‑8 helpers to Bitstring and refactors assets/js/erlang/unicode.mjs to use them; also adds tests covering the new helpers and integration checks. Changes are mechanical replacement of inlined UTF‑8 decoding/validation with centralized routines.

Changes

Cohort / File(s) Summary
Bitstring UTF‑8 helpers
assets/js/bitstring.mjs
Added static methods: decodeUtf8CodePoint(bytes, start, length), isValidUtf8CodePoint(codePoint, encodingLength), isValidUtf8ContinuationByte(byte), and isValidUtf8Sequence(bytes, start, length) for UTF‑8 decoding/validation.
Unicode module refactor
assets/js/erlang/unicode.mjs
Replaced multiple inline UTF‑8 decoding/validation code paths with calls to the new Bitstring helpers across normalization/character-to-binary logic; adjusted truncation/flow to rely on centralized validation.
Tests for UTF‑8 helpers
test/javascript/bitstring_test.mjs
Added extensive tests for the new helpers and added integration checks (valid/invalid sequences, continuation bytes, overlongs, surrogates, out‑of‑range). All additions are test-only.

Sequence Diagram(s)

(omitted — changes are internal helper extraction and local refactor; no multi‑component sequential flow warranted)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • bartblast
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely describes the main change: extraction of UTF-8 sequence validation logic into the BitString class.
Linked Issues check ✅ Passed All coding objectives from issue #713 are met: UTF-8 validation logic centralized in BitString class with methods for decoding, validation, and sequence handling.
Out of Scope Changes check ✅ Passed All changes are within scope: UTF-8 helper methods added to Bitstring, refactored unicode.mjs to use these helpers, and comprehensive test coverage added.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
assets/js/erlang/unicode.mjs (1)

325-340: Four identical findValidUtf8Length closures — extract to a shared module-level helper

After this PR all four normalization functions (characters_to_nfc_binary/1, characters_to_nfd_binary/1, characters_to_nfkc_binary/1, characters_to_nfkd_binary/1) define findValidUtf8Length with byte-for-byte identical bodies. Lifting it to a module-level function would eliminate the duplication.

♻️ Proposed extraction (to be placed before the `Erlang_Unicode` object)
+// Scans forward to find the longest valid UTF-8 prefix of `bytes`.
+// Returns the byte offset past the last valid sequence.
+// Time complexity: O(n).
+const findValidUtf8Length = (bytes) => {
+  let pos = 0;
+  while (pos < bytes.length) {
+    const seqLength = Bitstring.getUtf8SequenceLength(bytes[pos]);
+    if (seqLength === false || !Bitstring.isValidUtf8Sequence(bytes, pos, seqLength))
+      break;
+    pos += seqLength;
+  }
+  return pos;
+};
+
 const Erlang_Unicode = {

Then remove all four local findValidUtf8Length declarations and reference the shared one.

Also applies to: 579-594, 686-701, 792-806

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/js/erlang/unicode.mjs` around lines 325 - 340, Extract the duplicated
closure findValidUtf8Length into a single module-level helper placed before the
Erlang_Unicode object and remove the four local definitions inside
characters_to_nfc_binary/1, characters_to_nfd_binary/1,
characters_to_nfkc_binary/1, and characters_to_nfkd_binary/1 so they reference
the shared helper; keep the exact validation logic (use
Bitstring.getUtf8SequenceLength and Bitstring.isValidUtf8Sequence) and ensure
callers still pass the same bytes array so behavior is unchanged.
assets/js/bitstring.mjs (1)

259-259: Object literals allocated on every call in hot-path functions — consider static class fields

Both firstByteMasks (line 259) and minValueForLength (line 602) construct a new object on every invocation. These functions are called per-byte during UTF-8 validation, making them hot paths. Promoting these to private static class fields (arrays, indexed by sequence length) eliminates per-call allocation:

♻️ Proposed refactor — static array fields
 export default class Bitstring {
   static `#decoder` = ERTS.utf8Decoder;
   static `#encoder` = new TextEncoder("utf-8");
+  // Indexed by UTF-8 sequence length (1–4); unused positions are 0.
+  static `#UTF8_FIRST_BYTE_MASKS` = [0, 0, 0x1f, 0x0f, 0x07];
+  static `#UTF8_MIN_CODE_POINT`    = [0, 0, 0x80, 0x800, 0x10000];
   static decodeUtf8CodePoint(bytes, start, length) {
     if (length === 1) return bytes[start];
-
-    // First byte masks: 2-byte=0x1f, 3-byte=0x0f, 4-byte=0x07
-    const firstByteMasks = {2: 0x1f, 3: 0x0f, 4: 0x07};
-
-    let codePoint = bytes[start] & firstByteMasks[length];
+    let codePoint = bytes[start] & $.#UTF8_FIRST_BYTE_MASKS[length];
   static isValidUtf8CodePoint(codePoint, encodingLength) {
-    const minValueForLength = {1: 0, 2: 0x80, 3: 0x800, 4: 0x10000};
-
-    if (codePoint < minValueForLength[encodingLength]) return false;
+    if (codePoint < $.#UTF8_MIN_CODE_POINT[encodingLength]) return false;

Also applies to: 602-602

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/js/bitstring.mjs` at line 259, The object literals firstByteMasks and
minValueForLength are being reallocated on every hot-path call; move them out of
the functions and into private static class fields (e.g.,
BitString.#firstByteMasks and BitString.#minValueForLength) as arrays indexed by
sequence length, then replace uses of the local objects with lookups into these
static fields to avoid per-call allocations (update any functions referencing
firstByteMasks and minValueForLength to use the static field names).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@assets/js/bitstring.mjs`:
- Around line 619-634: isValidUtf8Sequence currently omits verification that
bytes[start] is a valid UTF‑8 leader for the given length; update the function's
doc comment (above static isValidUtf8Sequence) to explicitly state the
precondition that the leader byte must be pre-validated (e.g., by
getUtf8SequenceLength) and that this routine only checks bounds, continuation
bytes via $.isValidUtf8ContinuationByte and codepoint validity via
$.decodeUtf8CodePoint/$.isValidUtf8CodePoint; alternatively, if you prefer full
self‑validation, add an explicit leader‑byte check against the expected leader
bit pattern for the provided length before the continuation checks.

In `@assets/js/erlang/unicode.mjs`:
- Line 687: In characters_to_nfkc_binary/1 update the inline comment "// scan
forward, validating each sequence" to match the capitalization used in the other
findValidUtf8Length blocks (e.g. change to "// Scan forward, validating each
sequence") so the comment casing is consistent with the other occurrences in the
findValidUtf8Length implementations.

---

Nitpick comments:
In `@assets/js/bitstring.mjs`:
- Line 259: The object literals firstByteMasks and minValueForLength are being
reallocated on every hot-path call; move them out of the functions and into
private static class fields (e.g., BitString.#firstByteMasks and
BitString.#minValueForLength) as arrays indexed by sequence length, then replace
uses of the local objects with lookups into these static fields to avoid
per-call allocations (update any functions referencing firstByteMasks and
minValueForLength to use the static field names).

In `@assets/js/erlang/unicode.mjs`:
- Around line 325-340: Extract the duplicated closure findValidUtf8Length into a
single module-level helper placed before the Erlang_Unicode object and remove
the four local definitions inside characters_to_nfc_binary/1,
characters_to_nfd_binary/1, characters_to_nfkc_binary/1, and
characters_to_nfkd_binary/1 so they reference the shared helper; keep the exact
validation logic (use Bitstring.getUtf8SequenceLength and
Bitstring.isValidUtf8Sequence) and ensure callers still pass the same bytes
array so behavior is unchanged.

@mward-sudo mward-sudo force-pushed the 02-18-extracts_utf-8_sequence_validation_to_bitstring_class branch from 073775f to b9bfe2d Compare February 18, 2026 23:50
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
assets/js/bitstring.mjs (1)

255-269: Per-call object literal in a hot-path inner function.

firstByteMasks is re-allocated on every invocation of decodeUtf8CodePoint, which is called from within the tight while loop in findValidUtf8Length across every unicode function. Hoisting it (or isValidUtf8CodePoint's minValueForLength, see below) to a module-level constant eliminates per-call heap allocation.

♻️ Proposed refactor: hoist lookup table to module scope
+// Pre-allocated lookup tables for UTF-8 decoding/validation (avoid per-call allocation)
+const _UTF8_FIRST_BYTE_MASKS = [0, 0xff, 0x1f, 0x0f, 0x07];
+const _UTF8_MIN_CODE_POINT = [0, 0, 0x80, 0x800, 0x10000];

 static decodeUtf8CodePoint(bytes, start, length) {
   if (length === 1) return bytes[start];

-  // First byte masks: 2-byte=0x1f, 3-byte=0x0f, 4-byte=0x07
-  const firstByteMasks = {2: 0x1f, 3: 0x0f, 4: 0x07};
-
-  let codePoint = bytes[start] & firstByteMasks[length];
+  let codePoint = bytes[start] & _UTF8_FIRST_BYTE_MASKS[length];

   for (let i = 1; i < length; i++) {
     codePoint = (codePoint << 6) | (bytes[start + i] & 0x3f);
   }

   return codePoint;
 }
 static isValidUtf8CodePoint(codePoint, encodingLength) {
-  const minValueForLength = {1: 0, 2: 0x80, 3: 0x800, 4: 0x10000};
-
-  if (codePoint < minValueForLength[encodingLength]) return false;
+  if (codePoint < _UTF8_MIN_CODE_POINT[encodingLength]) return false;
   if (codePoint >= 0xd800 && codePoint <= 0xdfff) return false;
   if (codePoint > 0x10ffff) return false;
   return true;
 }

Place the two const declarations at the top of the module (outside the class), alongside the existing private class fields area, to keep them co-located with the methods that use them.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/js/bitstring.mjs` around lines 255 - 269, Hoist the per-call lookup
tables into module-level constants to avoid allocating them on every
decodeUtf8CodePoint invocation: move the object firstByteMasks (used in
decodeUtf8CodePoint) — and likewise minValueForLength used by
isValidUtf8CodePoint if present — out of the function and define them once at
module scope, then update decodeUtf8CodePoint and isValidUtf8CodePoint to
reference those constants instead of creating new objects each call.
assets/js/erlang/unicode.mjs (1)

326-339: Four identical findValidUtf8Length bodies — consider extracting to Bitstring.

Now that the body is entirely composed of Bitstring static calls, the implementations in characters_to_nfc_binary/1, characters_to_nfd_binary/1, characters_to_nfkc_binary/1, and characters_to_nfkd_binary/1 are byte-for-byte identical. A single Bitstring.findValidUtf8PrefixLength(bytes) helper would eliminate the duplication and be consistent with the PR's overall goal of centralising UTF-8 logic.

♻️ Sketch of the extraction

In assets/js/bitstring.mjs:

+  // Scans bytes forward and returns the length of the longest valid UTF-8 prefix.
+  static findValidUtf8PrefixLength(bytes) {
+    let pos = 0;
+    while (pos < bytes.length) {
+      const seqLength = $.getUtf8SequenceLength(bytes[pos]);
+      if (seqLength === false || !$.isValidUtf8Sequence(bytes, pos, seqLength))
+        break;
+      pos += seqLength;
+    }
+    return pos;
+  }

Then each findValidUtf8Length local inside the NFC/NFD/NFKC/NFKD functions collapses to:

-    const findValidUtf8Length = (bytes) => {
-      // Scan forward, validating each sequence
-      let pos = 0;
-      while (pos < bytes.length) {
-        const seqLength = Bitstring.getUtf8SequenceLength(bytes[pos]);
-        if (
-          seqLength === false ||
-          !Bitstring.isValidUtf8Sequence(bytes, pos, seqLength)
-        )
-          break;
-        pos += seqLength;
-      }
-      return pos;
-    };
+    const findValidUtf8Length = Bitstring.findValidUtf8PrefixLength;

Also applies to: 583-594, 690-701, 795-806

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/js/erlang/unicode.mjs` around lines 326 - 339, Extract the repeated
UTF-8 prefix scanning logic into a new
Bitstring.findValidUtf8PrefixLength(bytes) helper that loops using
Bitstring.getUtf8SequenceLength(byte) and Bitstring.isValidUtf8Sequence(bytes,
pos, seqLength) and returns the valid prefix length; add this helper to
Bitstring (e.g., assets/js/bitstring.mjs) and replace the inline loops found in
characters_to_nfc_binary/1, characters_to_nfd_binary/1,
characters_to_nfkc_binary/1, and characters_to_nfkd_binary/1 (and the other
occurrences around the indicated ranges) with a single call to
Bitstring.findValidUtf8PrefixLength(bytes).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@assets/js/erlang/unicode.mjs`:
- Line 687: Normalize the comment casing for consistency by changing the inline
comment "// scan forward, validating each sequence" to use sentence-style
capitalization (e.g., "// Scan forward, validating each sequence") wherever this
pattern appears in the file; update the specific comment near the Unicode
validation logic in unicode.mjs (the comment currently reading "scan forward,
validating each sequence") so it matches the project's comment casing standard.

---

Nitpick comments:
In `@assets/js/bitstring.mjs`:
- Around line 255-269: Hoist the per-call lookup tables into module-level
constants to avoid allocating them on every decodeUtf8CodePoint invocation: move
the object firstByteMasks (used in decodeUtf8CodePoint) — and likewise
minValueForLength used by isValidUtf8CodePoint if present — out of the function
and define them once at module scope, then update decodeUtf8CodePoint and
isValidUtf8CodePoint to reference those constants instead of creating new
objects each call.

In `@assets/js/erlang/unicode.mjs`:
- Around line 326-339: Extract the repeated UTF-8 prefix scanning logic into a
new Bitstring.findValidUtf8PrefixLength(bytes) helper that loops using
Bitstring.getUtf8SequenceLength(byte) and Bitstring.isValidUtf8Sequence(bytes,
pos, seqLength) and returns the valid prefix length; add this helper to
Bitstring (e.g., assets/js/bitstring.mjs) and replace the inline loops found in
characters_to_nfc_binary/1, characters_to_nfd_binary/1,
characters_to_nfkc_binary/1, and characters_to_nfkd_binary/1 (and the other
occurrences around the indicated ranges) with a single call to
Bitstring.findValidUtf8PrefixLength(bytes).

@mward-sudo
Copy link
Contributor Author

@coderabbitai Don't do any further reviews on this PR while it is in draft.

@coderabbitai
Copy link

coderabbitai bot commented Feb 19, 2026

@mward-sudo: Understood! I won't perform any further reviews on this PR while it remains in draft status. Once you're ready for a review, you can ask me to resume or change the PR status to "Ready for review."

✅ Actions performed

Reviews paused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extract UTF-8 sequence validation to Bitstring class

1 participant